• Data leakage is a critical issue in data science that occurs when information is used during the training or evaluation of a model that would not be available during deployment. This can lead to overly optimistic performance metrics and ultimately result in poor model performance in real-world applications. The article presents three subtle examples of data leakage encountered in various projects, illustrating the complexities and potential pitfalls of data handling in predictive modeling. In the first example, the author worked with a company aiming to win sealed-bid auctions by predicting the price-to-beat. Initially, the company suggested filtering out lots priced above $1000 before building the model. The author quickly recognized that this approach was flawed, as it would lead to data leakage by excluding relevant information that could affect predictions. Instead, they proposed training on all available data but only reporting performance metrics for lots predicted to be below the $1000 threshold. This adjustment allowed for a more accurate assessment of the model's performance without falling into the trap of data leakage. The second example involved a different company that wanted to model potential earnings from auctioned lots. The author initially planned to use random sampling for training and testing datasets. However, they realized that this method would inadvertently mix data from different time periods, effectively creating a "time travel" scenario that could lead to leakage. After investigating, the author found that while the conventional random split approach worked adequately in some contexts, a strict chronological split yielded better performance for this specific dataset. This was due to the nature of the auction process, where similar lots were often sold in quick succession, making chronological splits more effective in preventing leakage. In the third example, the author encountered a situation where they identified leakage in a model designed to improve auction outcomes. They initially proposed a solution that they believed would not introduce leakage but later discovered that it did. This experience underscored the importance of vigilance in detecting and addressing leakage, as well as the necessity of thoroughly understanding the data-generating process. The key takeaways from these experiences emphasize that leakage always comes with a cost, which may vary in significance depending on the context. While some leakage may be tolerable, it is crucial to assess its potential impact. The article also highlights that just because a practice is common in the industry does not mean it is free from leakage. Moreover, detecting leakage is often easier than quantifying its effects, and sometimes the damage can be identified through the performance issues it causes. Overall, the discussion serves as a reminder of the complexities involved in data science and the importance of maintaining rigorous standards to avoid data leakage, ensuring that models perform reliably in real-world scenarios.